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Abstract 

Background: Multi-site health sciences research is becoming more common, as it enables investigation of rare 
outcomes and diseases and new healthcare innovations. Multi-site research usually involves the transfer of large 
amounts of research data between collaborators, which increases the potential for accidental disclosures of 
protected health information (PHI). Standard protocols for preventing release of PHI are extremely vulnerable to 
human error, particularly when the shared data sets are large. 

Methods: To address this problem, we developed an automated program (SAS macro) to identify possible PHI in 
research data before it is transferred between research sites. The macro reviews all data in a designated directory to 
identify suspicious variable names and data patterns. The macro looks for variables that may contain personal 
identifiers such as medical record numbers and social security numbers. In addition, the macro identifies dates and 
numbers that may identify people who belong to small groups, who may be identifiable even in the absences of 
traditional identifiers. 

Results: Evaluation of the macro on 100 sample research data sets indicated a recall of 0.98 and precision of 0.81. 

Conclusions: When implemented consistently, the macro has the potential to streamline the PHI review process 
and significantly reduce accidental PHI disclosures. 
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Background 

Studying rare outcomes, new interventions, and diverse 
populations frequently requires collaborations across mul- 
tiple healthcare institutions. As the capacity to exchange 
health research data grows through the development of 
distributed research networks, healthcare research col- 
laboratories, and computing grids, the amount of new 
public health research involving partnerships across aca- 
demic institutions, healthcare delivery systems, insurance 
providers and pharmaceutical companies is also growing 
[1]. Existing large-scale multi-site research and public 
health collaborations include HMO Research Network 
(HMORN) [2] based projects such as the Cancer and 
Cardiovascular Research Networks, the Vaccine Safety 
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Datalink [3], and the Centers for Education and Research 
on Therapeutics [4], the FDA Sentinel project [5] and the 
Scalable PArtnering Network (SPAN) [6], among others. 
These collaborations often require the release of ag- 
gregated patient data or fully or partially de-identified 
patient-level information from participating institutions 
to the lead research site. Occasionally fully-identifiable 
patient information is required, subject to Institutional 
Review Board (IRB) approval and appropriate Data Use 
Agreements (DUAs). 

In accordance with the Privacy and Security Regulations 
of the Health Insurance Portability and Accountability Act 
of 1996 (HIPAA), collaborating institutions work closely 
with their IRB to ensure that protected health information 
(PHI) used in research meets the "minimum necessary" 
requirements and has all appropriate safeguards [7]. As 
public health research collaborations grow more common, 
the potential for accidental disclosure of PHI also grows. 
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Over the last several years, the authors have been aware of 
several accidental disclosures including temporary data 
that were accidentally released to the lead research site, 
multi-site extraction programs that failed to completely 
replace patient identifiers with de-identified study ids, and 
programmers who forgot to remove one or more of the 18 
patient identifiers described by the Privacy Rule from the 
final data sets. Given the tight deadlines and complex data 
sets often required by multi-site research, accidental inclu- 
sion of PHI in research data sets is a real concern. Methods 
that make it easier to identify PHI and determine whether 
it is authorized for a given research project could signifi- 
cantly reduce accidental unauthorized PHI disclosures. 

Effective PHI management requires efforts at multiple 
levels including national and organizational policy inter- 
pretation, access control, and control over data release. 
Ontology-based models have been developed to imple- 
ment national and organizational policy as rule-based 
systems that control data access based on a complex 
interrelationship between the data user, the patient, and 
the purpose of the data use [8]. These systems control 
who has access to the data using metadata attached to 
the data elements. At the data release level, the goal is 
generally to release the least amount of personal data 
necessary to achieve the scientific objective. To that end, 
there have been several efforts to develop methods to 
scrub PHI from electronic health records to make them 
more accessible for research. Most de-identification 
techniques have focused on removing all PHI from text 
such as discharge notes, radiology or pathology reports, 
or progress notes (see [9,10] for reviews) in order to 
allow the text to be shared with collaborators or other 
researchers. These techniques focus on scrubbing all 
PHI elements from the medical record data through 
either lexical, heuristic or pattern based systems or 
machine learning approaches. However, many collabora- 
tive research data sets consist primarily of structured 
data, which may or may not be allowed to contain PHI 
elements depending on the IRB agreements. For ex- 
ample, some elements of PHI, such as names or medical 
record numbers, may need to be removed from the 
research data sets, while other elements such as birth 
date and gender may be allowed to support the research. 
In these cases, protecting PHI during research may 
primarily require comprehensive review of the research 
data sets prior to disclosure to ensure that only allo- 
wable PHI is included. 

Existing PHI review techniques primarily rely on 
humans to review the data and interpret and apply the 
PHI restrictions correctly. However, reviewing large, 
complex data sets can be difficult: study data sets 
frequently include thousands to millions of records and 
its unrealistic to expect the data reviewer to review 
every record for data that may qualify as PHI. Methods 



that reduce the burden on the data reviewer by automa- 
tically identifying possible PHI in collaborative data sets 
have the potential to significantly reduce the probability 
of accidental PHI disclosure. 

In this work, our goal was to create an automated 
process that would reduce the manual effort of checking 
the research data sets for PHI. The purpose was not to 
scrub the data sets of PHI, but to alert the researchers of 
PHI contained in the data sets for cross-checking against 
IRB and DUAs. We formed a group of 5 experienced 
multi-site programmers (PHI Work Group) to identify 
and address the most common causes of accidental PHI 
disclosure. The group identified five major problems that 
may lead to accidental disclosure of PHI: 

1. Inclusion of data sets meant to be retained locally in 
the data that gets transferred to the lead site. 

2. Failure to substitute a study-id for patient identifiers. 

3. Failure to "scrub" patient identifiers, such as medical 
record numbers and social security numbers from 
the data set prior to transfer. 

4. Inclusion of dates that indicate rare characteristics, 
such as advanced age (i.e. age > 89) 

5. Indications of small populations with rare disorders. 

To address these issues, we developed a macro to 
identify PHI in SAS data sets prior to data release. 

Methods 

Data environment 

The HMORN is a consortium of 19 health care delivery 
organizations that conducts collaborative research on a 
wide variety of healthcare topics [2]. To facilitate colla- 
borative research across disparate healthcare delivery 
organizations, the HMORN has developed a set of stan- 
dardized data specifications for a virtual data warehouse 
(VDW) [11]. To obtain data for multi-site research pro- 
jects, HMORN analysts at the lead research site develop 
and distribute SAS scripts using common variable names 
to reference the standardized data structures. Participa- 
ting sites run the scripts within their own environment 
and transfer the resulting data sets to the lead site for 
final analysis. The transfer data sets can range from 
aggregate counts to patient-level data about encounters, 
diagnoses and procedures, prescriptions, and lab test 
results depending on the research needs, the DUA and 
the IRB agreement. 

Transfer of data from participating sites to the lead 
research site represents the greatest risk of inappropriate 
PHI disclosure during the research project. Every effort 
is made to restrict the transferred data to the minimum 
necessary for complete and accurate study analysis, in- 
cluding de-identification, assigning study IDs, and redac- 
ting counts that may identify small populations with rare 
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diagnoses or procedures (referred to as "small cell sizes"). 
To ensure transfer data sets include appropriate data only, 
all data sets must be reviewed prior to transfer to ensure 
that they do not contain PHI beyond what is allowed 
under data sharing and IRB agreements. Data set review 
involves identifying and removing any disallowed vari- 
ables, as well as reviewing the data itself for individual 
instances of PHI. This is particularly difficult for items 
such as small cell sizes and ages greater than 89, which 
can be buried in a single record of a massive data set. 
Manually checking every data set to make sure it does not 
contain unauthorized data, including PHI, can be a time 
consuming and error-prone process. 



Approach 

The three project requirements were: 1) the PHI identifi- 
cation process must identify the most common forms of 
PHI in structured data; 2) the process must be fast to 
implement to allow quick turnaround; and 3) it must 
leave the decision about whether PHI is allowable in the 
hands of research personnel. Based on these require- 
ments, the PHI Workgroup developed an automated 
program (macro) designed to identify potential PHI in a 
SAS analytic environment, referred to as the PHI Detec- 
tion macro. The macro evaluates the directory containing 
files to be transferred to the lead programming site (trans- 
fer directory), identifies SAS data sets, and scans each SAS 
data set for possible PHI elements. The macro uses a pat- 
tern matching approach and is designed to work on data 
sets that contain structured data. It relies primarily on 
regular expressions to identify patterns consistent with 
medical record numbers of social security numbers, field 
formats to identify dates, and keywords as fieldnames. 

The PHI detection macro is designed to run in SAS 
9.1 or higher. The macro analyzes the data in the trans- 
fer directory at two levels: a high-level overview of all 
files in the directory for comparison against the pro- 
gramming workplan, and a detailed analysis of the data 
in each data set. The results of all analyses are printed to 
a PDF report for review prior to data transfer. The code 
for the PHI detection macro can be downloaded from 
https://github.com/HMORN/phi_macros. 



The high-level analysis consists of a scan of all files in 
the transfer directory. The macro creates a summary 
listing of each file type and the number of files matching 
that file type in the directory. It also creates a directory 
listing of each file in the directory, including the name 
and file type. For SAS data sets, the directory listing also 
indicates the creation date, modification date and the 
number of records in the data set. The detailed analysis 
provides four separate checks on each data set. The first 
data check looks for variable names that may indicate 
common PHI elements such as personal identifiers, birth 
dates, health encounter dates and death dates. Site- 
specific variable names that may indicate PHI can be 
specified as a pre-defined global variable. For example, 
sites which use the variable med_rec_no to refer to a a 
patient's medical record number may want to ensure 
that med_rec_no does not occur in transfer data sets. 
All variable name checks are case insensitive. Table 1 
provides examples of some of the strings evaluated in 
this data check. An example of a site-specific string is 
shown in the last row of Table 1. 

The second data check performed on each transfer 
data set evaluates strings in the data set to determine if 
the data matches a regular expression that represents 
the sites standard personal identifiers (i.e. Medical 
Record Numbers). Regular expressions are a highly flex- 
ible method for defining text strings to be used in string 
comparisons. For example, a medical record number 
that consists of 8 to 9 numeric characters would be 
defined as "( A \d{8, 9}\s)", and would match the string 
"12345678" in the data set. The macro uses the SAS 
function prxmatch to compare character data in the data 
set to the regular expression. The macro can be custom- 
ized to evaluate every record, or to restrict the evaluation 
to a specific number of records to improve processing 
speed. 

A third data check evaluates the transfer data sets con- 
tain dates that may reflect birth, death or healthcare 
utilization dates. The data check further evaluates the 
contents of all date variables to determine whether the 
date may indicate a person over age 89. The macro uses 
89 as the default age, as specified in the Privacy Rule [7], 
but individual sites can override the default to flag 



Table 1 Example of strings that may indicate PHI if they are used as variable names 
Type of PHI 



Potential variable names 



Medical Record Number 

Subject Name 

Social Security Number 

Birth/death dates 

Health encounter dates 

Locally "forbidden" variable names 

This list is not exhaustive. 



MRN, Medical_Record_Number, Medical_Record_Num, Medical_Record_No, Med_Rec_No, Med_Record_Num, 

Name, First_Name, Last_Name, birth_name, maiden_name, Surname 

SSN, social_security_number, SocialSecuityNumber, socsec 

Birth_date, bday, birthDate, death_date, deathDay 

Encounter_dt, enc_dt, encDt 

Medical_history_number|MHN|Pat_ID|Pat_MRNJd 
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younger ages. Variables are considered date variables if 
either the variable format is a date type, or the variable 
name contains the word "date." For each date variable, 
the macro reviews all records to determine if the data 
set contains any dates that may indicate the patient was 
older than 89. The macro also looks for variable names 
containing the word "age" and looks for records that 
may identify individual patients who are older than 89. 

The purpose of the fourth data check is to identify 
small groups that may indicate rare conditions or treat- 
ments. According to the privacy rule [7], individuals 
with rare or uncommon diagnoses or conditions may be 
identifiable even when the 18 specific patient identifiers 
are removed [7] and therefore information about indi- 
viduals with such rare conditions should be considered 
PHI. The macro scans all numerical variables in the 
transfer data sets to identify values between 1 and 5 
(inclusive) and prints a report listing all numerical vari- 
ables with values between 1 and 5. 

The final step in the detail analysis is to print 5 sample 
records to the PHI report. The sample records allow the 
data reviewer to manually review all variable names and 
a subset of data to find any potential PHI the automated 
analysis may have missed. 

Evaluation 

We measured the performance of the phi detection macro 
by testing it against two types of data. The first test data 
set consisted of fake data that contained multiple examples 
of PHI: dates, names, addresses, medical record numbers, 
birth dates, and social security numbers. We ran the 
macro on the directory containing the fake PHI data and 
evaluated how many of the PHI types the macro identified. 

For the second test of the PHI macro, we created a 
collection of 100 data sets from previous research projects. 
The collection included data sets that were shared with 
other research sites, as well as data sets that were meant 
to be retained locally. Data sets in the collection contained 
a variable level of PHI: some data sets contained names, 
addresses and medical record numbers, while other data 
sets contained no PHI. We restricted each data set to 50 
records to facilitate manual review. 

We used manual review as our gold standard of PHI de- 
tection. Every data set was reviewed by two people to ensure 
that all examples of PHI were identified. For each instance 
of PHI, we captured the field name and type of PHI to a log. 
Once all test data sets had been hand reviewed, we executed 
the macro on the directory containing the collection of data 
sets, and cross-checked the PHI report against the PHI log 
generated from the hand review process. 

Results 

The HMORN PHI Workgroup developed a SAS macro 
designed to identify PHI in research data sets. The PHI 



detection macro scans the transfer directory to identify 
files to review and performs a detailed analysis of each 
SAS data set in the transfer directory. The macro pro- 
duces a PDF report that the site data reviewer, usually 
the research analyst or project manager, can review to 
ensure that any data to be released from the collabora- 
ting site is consistent with the data sharing and IRB 
agreements and does not contain unauthorized PHI. 
Examples of complete reports using fake patient data 
can be found at http://mapri.kaiserpermanente.org/re- 
search/mapri-sample-reports/. 

The PHI detection report contains two sections: the 
overview and the detail section. Figure 1 shows an 
example of the PHI detection report. The overview 
section provides a count of files by file type and a listing of 
all files in the transfer directory. Figure la illustrates a typ- 
ical example of the overview section of the PHI Detection 
report. The file listing includes a record count for all SAS 
data sets, as well as the date each data set was created and 
modified. Both the file count and the file listing can be 
compared to the expected output described in the pro- 
grams workplan to evaluate whether the program has 
produced the correct data sets and to ensure there are no 
unexpected files in the transfer directory. In addition, 
when the data sets contain individual-level data, the rec- 
ord count for population data sets can be compared to the 
estimated size of the target population to ensure that the 
program is identifying the appropriate population. 

The detail section of the PHI detection report 
(example shown in Figure lb) contains information 
about each data set in the transfer directory. The report 
contains three subsections for each data set: warnings 
regarding potential PHI the program has identified, sam- 
ple records, and information about low values of the 
numeric variables. The warnings subsection contains the 
results of all automatic PHI checks, including checks for 
"forbidden" variable names, comparisons of string data to 
potential study identifier formats, and checks for dates 
greater than the cutoff value. If any of these checks iden- 
tify data that may be PHI, a warning is printed to the 
report describing the potential problem. An example of 
these warnings is shown in the top section of Figure lb. 

The second subsection of the detailed analysis of each 
data set displays a small number of example records 
from the data set for manual inspection. In cases where 
potential problems have been identified such as variables 
that meet the defined pattern for a medical record num- 
ber, or records with small or zero numerical values, 
records containing the identified concerns are selected 
for review. In cases where the macro has not identified 
any specific concerns, five random records are selected 
for review. Figure lb shows the case where the macro 
has identified two potential concerns: records with data 
that contain patterns that are consistent with PHI, and 
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PHI-Detection Report for the datasets in 
/apps/sas/datasets/data2/MAVDW/data/lnvestigators/cbredfeldt/test_phi. 

please inspect all output carefully to make sure it comports with your data sharing agreement! !! 



filetype 




Directory 


2 


sas7bdat 


3 



List of all files in the transfer directory 



filename 


filetype 


Numberof 

Physical 

Observations 


Date Created 


Date Modified 


out 


Directory 








test 


Directory 








transferl.sas7bdat 


sas7bdat 


20 


09SEP11:11:56:49 


09SEP11:11:56:49 


transfer2.sas7bdat 


sas7bdat 


20 


09SEP11:11:56:49 


09SEP11:11:56:49 


transfer3.sas7bdat 


sas7bdat 


20 


09SEP11:11:56:49 


09SEP11:11:56:49 



WARNINGS for dataset transferl 



variable 


warning 


BIRTH_DATE 


If this is a date, at least one value is 90 years ago, which is older than 89. If this date applies to a person, 
the record is probably PHI. 


BIRTH_DATE 


Namesuggests this var may be a date of birth. 


MRN 


Could this var hold personal identifiers? Contents of 20 records match the pattern given for personal 
identifiers. Identifiers should never move across sites. 


MRN 


Namesuggests this var may be an MRN, which should never move across sites. 


5SN 


Could this var hold personal identifiers? Contents of 20 records match the pattern given for personal 
identifiers. Identifiers should never move across sites. 


SSN 


Namesuggests this var may be a social security number. 



Sample Observations: please review variable names and data for possible PHI 



Row 


MRN 


STUDYID 


sid 


SSN 


BIRTH_DATE 


pop_size 


1 


A11292591 


A11292591 


A11292591 


112-92-5855 


10/01/1921 


5 


2 


A27027021 


A27027021 


A27027021 


270-27-0150 


09/29/1923 


8 


3 


A61919731 


A61919731 


A61919731 


619-19-7249 


09/29/1923 


3 


4 


A72055681 


A72055681 


A72055681 


720-55-6751 


10/02/1926 


2 


5 


A82346489 


A82346489 


A82346489 


823-46-4831 


10/02/1926 


8 



Minimum values of numeric-type variables: Please check for small cell sizes where appropriate. 



Variable 


Minimum 


BIRTH_DATE 


10/01/1921 


Pop_size 


2 



Figure 1 Example of a report from the PHI detection macro, (a) High level analysis showing the number of files in the transfer directory, by 
file type, (b) Detailed report of potential PHI in the sas data set files in the transfer directory (top), as well as example records (middle) and 
minimum values of numerical variables. 



records with numerical values that may indicate critic- 
ally small populations. The sample records allow the 
data reviewer to examine specific examples to determine 
if the records genuinely contain PHI, and prompt the 
reviewer to check any PHI found in the data set against 
the IRB and data sharing agreements. 

The final subsection identifies the minimum value of 
all numeric variables in the data set. This analysis 
assumes that numeric variables may represent counts of 
events or members of a population. If the minimum 
value of a count variable is zero, it may represent miss- 
ing data and indicate that there is an error in the code 
that needs to be reviewed. In addition, according to 
HIPAA, populations with between 1 and 5 people may 



be considered PHI if they have rare conditions because 
it is relatively easy to identify the individuals in the 
population. Thus, if the minimum value of a numeric 
variable is between 1 and 5 (inclusive), that variable 
should be inspected to ensure that either it does not 
represent a population count, or low population counts 
are allowable in the transfer data sets based on the IRB 
approval and data sharing agreements. 

Performance: We first evaluated the PHI Detection 
macro on three test data sets containing a variety of PHI, 
including medical record numbers (MRNs), social security 
numbers, ages greater than 89, birth dates, and small pop- 
ulations. The PHI report correctly listed all 20 fields that 
either contained PHI or were likely to contain PHI based 
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on the field name (one field labeled "MRN" contained 
study-specific identifiers, and was flagged by the report). 

We next evaluated the PHI Detection macro on 100 
research data sets, comparing the PHI detected by the 
macro to that found on manual review. The test data 
sets were drawn from a combination of site-specific data 
(likely to contain PHI) and collaborative data (should 
not contain extensive PHI). Each test data set was 
restricted to the top 50 records to improve accuracy of the 
manual review process. Manual review was performed by 
two experienced research analysts to ensure accuracy. The 
PHI Detection macro correctly identified 111 out of 113 
instances of PHI in the test data sets, for a recall of 0.98. 
There were also 26 false positives out of a total of 809 data 
fields, for a specificity of 0.96. The macro has a precision 
of 0.81, and an F-score of 0.88. 

Discussion 

We developed a SAS macro program to identify poten- 
tial PHI in collaborative data sets. The macro scans all 
data sets in a given directory for variable names, data 
patterns, and numeric values that may represent PHI. 
Metadata about the data sets in the directory is printed 
to a PDF report, along with any warnings identified in 
the data. The analyst or investigator can then use the 
report to determine if the research data sets contain 
unauthorized PHI. 

Managing PHI carefully in multi-site research environ- 
ments is critical to protecting our patients and comply- 
ing with federal laws. In this paper we describe an 
approach for identifying PHI in collaborative research 
environments that work primarily with structured data, 
such as utilization records, lab data and patient vitals. 
Our approach uses a relatively simple pattern matching 
method that leverages the metadata contained in struc- 
tured data fields through two techniques: 1) pattern 
matching on field names; and 2) using field data types to 
identify dates. We further use regular expressions to find 
fields containing distinctive identifiers such as medical 
record numbers and social security numbers. By lever- 
aging the structure of the data sets, we are able to use a 
relatively small dictionary of 21 terms that can be 
customized as needed for each site. Despite the limited 
data dictionary, the macro has high sensitivity and speci- 
ficity in identifying common PHI elements such as med- 
ical record numbers, dates and patient names. 

The majority of previous efforts at PHI protection for 
multi-site research have focused on de-identifying or an- 
onymizing free-text documents such as pathology reports 
and progress notes (see [9,10]). Free text can contain com- 
plex forms of PHI such as proper names, making them 
much more difficult to scrub. These efforts have led to the 
development of more sophisticated machine learning and 
lexical, heuristic and pattern based methods for identifying 



PHI. These methods often require a large corpus to train 
the machine learning algorithms, or an extensive diction- 
ary to support the pattern matching approach [9]. In cases 
where multi-site research is conducted entirely on struc- 
tured data where there is less variability in the data and a 
more limited set of potential PHI elements, these ap- 
proaches may be more sophisticated and resource inten- 
sive than is necessary. A simple method that quickly and 
accurately scans a large body of data to produce a PHI re- 
port could significantly improve the probability of identi- 
fying unauthorized PHI prior to transfer. 

Our approach differs from many PHI-protection 
approaches in that it focuses on identifying, but not 
removing, potential PHI. Previous privacy protection 
methods on structured data have focused on anonymi- 
zing the data by removing data points until individuals 
are statistically indistinguishable [12]. However, this 
method is open to reverse engineering and may remove 
critical data attributes [13]. In this work, we start from 
the perspective that many multi-site research projects 
have IRB approval to share certain types of PHI neces- 
sary for adequately answering the research question. The 
aim with this work was to generate a report that would 
allow a human reviewer to determine whether any PHI 
identified in the research data sets can be shared with 
other research sites based on both IRB and DUAs. 

The macro is not meant to replace human review of 
the data sets. Rather, it is intended to increase the 
efficiency and completeness of the data review. Manual 
review can be tedious and inaccurate for research pro- 
jects containing multiple data sets with thousands to 
millions of records each. By explicitly generating war- 
nings regarding potential PHI elements in the data sets, 
the macro draws attention to data that may need closer 
review prior to release. In addition, the PHI detection 
macro is able to review every record for inappropriate 
dates, ages or counts, which can be impossible for a 
human to do in a large, complex data set. Finally, by 
providing a clear list of all files in the transfer directory, 
the macro makes it easier to confirm that only the 
intended files are included in the transfer directory. 

As with other PHI-protection procedures, the macros 
are only effective if multi-site programmers use them 
properly. The four most common problems that may 
reduce the effectiveness of the macros is specifying the 
wrong directory for data review, failing to adequately 
specify the regular expressions used to define key identi- 
fiers, failing to review the reports produced by the macros, 
and failing to run the macros prior to transferring the 
data. Three of these four problems can be addressed by 
using a PHI checklist for final review. Such a checklist 
would prompt the data reviewer to compare the informa- 
tion in the PHI report to the information in the program s 
workplan to ensure the transfer directory contains the 
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right data and any PHI identified in the report is allowable 
under the terms of the IRB approval and DUAs. The 
fourth problem (incorrect specification of the regular ex- 
pressions) can be addressed by testing the regular expres- 
sion definition against some sample data using publicly 
available tools such as http://www.regextester.com/. 

Conclusions 

The PHI Protection macros described in this paper are 
intended to reduce accidental PHI disclosure in multi- 
site research using structured data by automating the re- 
view of shared data sets. The automated review provides 
a report describing all files in the transfer directory, 
including warnings if it finds common indicators of PHI 
in either the data sets or the programming logs. When 
used in conjunction with a careful, checklist-guided 
manual review of the data for unauthorized PHI, the 
macro has the potential to significantly reduce accidental 
PHI disclosures. 
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